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ABSTRACT 



A^multiprocessoE system-includcs-a^lurality^ 
processing'units j(CPUs)-cohDected" to" one andther'byZ^ 
c System -b tjs. Each CPU-includ es:a:cache-controllcr to comji^ 
mumcate with its cacne, and a pnmary memory-controller- to 
cprnmunicatc^itti^its" pfiffrarv^memory . When^there-isza^i 
cache:iniss:inXCP- U;:tlfe7cachercontroU 
requcst:for'pfimaty"memory~directly-taTlie~p^ 
viaithezCPU^~a~speculative-request-without~access-the3 
system bu^-anddsoissuestheaddressrequest-to-the system 
busIlo3facilitaterdatarcbherencyr3J^ei^^ 
q ueuex3:in"the i primafv^emo^ 
retricves:speculativerdatajrom a specifiediprinu 
address, ThelCEU-monitors-the-systembus-for-asubsequelil 
tra nsaction"tM t"requests-tbc^specifledrdata-in^^ 
memdryZIf:ttie:subsequent:transaction:requesting:the:sped- 
fied^dataiis^azreadiltransactionrtbatiicorrcsponds'^lo^ 
spe.c;ulative7address:request;:the-spetulative-request-is^va^ 
datcd^ndiibecomesinonrspcculatiye— If^^ 
the:subsequent:trarisaction:requesting^tKe^specified-'d 
write:transaclion;:the:speculative:request:is:cancclcd7) 

19 Claims, 6 Drawing Sheets 
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METHOD TO REDUCE MEMORY isifoimd:m:the:Ll:cacfie,~the-d 

LATENCIES BY PERFORMING TWO Jo Ihe^ GPU20^r pj^ocessing, 

LEVELS OF SPECULATION lfrmthe:other:hand; the.spedfieddata.is not fo^^^ 

iEtoiialxa^che;:thejextemal:cache:24is:t^^ t~' 
s sp^dfied_data is fbimdJinlthe.^ wiV^^*^*"'^''^ 

BACKGROUND returnedjo th e CPU:20 :fpxprgcessiiig. If:the:sp-ecified_^^ CacUjL / J 

1 Field of Invention .n6lJinI^the-extemal-clcfie^24,-the- address-requ ^AA>^^^^^^ 

Tliis invention relates generally to superscalar processors, fowardea to^tlw JyS^^^ 

and specifically to improving memory latencies of super- address:fequest^to4he--prunary memory ZS^o^^h orthe- 
scalar processors in multi-processor systems. " y^ious CPUs ZO-via.respectivc pnmary-memory controllers 

2. Description of Related Art '^x,. . 

^ , , M . . . . r When running a computer program on a multiprocessor 
Modem computer systems utilize a hierarchy of memory ^ ^^^^ ^ ^ ^ Pj^.^ ^ ^ instructions which 
elements m order to redize an optimum balance between the ^^^^ ^^^^ ^ -j, ^^^^^ ^ ^^^^^^^ ^ 
speed, size, and cost of computer memory. These computer ^^^^^^^^ ^PUs. As a result, it is necessary to monitor 
systems lypically employ a Dynamic Random Access -^^^^^^.^^^ ..signed to the various CPUs in order to main- 
Memory (DRAM) as primary memory and include a larger, ^^-^ ^^^^ coherency. For example, if a first instruction 
but much slower, secondary memory such as, for instance, ^^^^^^^^ ^ ^^^^^ ^^^-g^^ ^^^^ ^ ^ ^^^^ 
a magnetic storage device or Compact Disc Read Only ^^^^^^^^ ^ subsequent instruction executed by CPU 
Memory (CD ROM). A small, fast Static Random Access ^^^^^ ^^^^^ ^^j^ ^1^^ .^^^ ^^^^^^^ jj^^ g^^^ 
Memory (SRAM) cache memory is typically provided instruction must be executed before the second instruction 
between the central processing umt (CPU) and pnmary ^-^^ ^^^^ ^ ^^^^ ^ ^1^^ ^^^^ instruction is modi- 
memory. This fast cache memory mcreases the data band- . instnicdon. 

width of the computer system by storing information most r^- ' . . . n . ^ _^ 

^ ,\ A /u 4U ODTT r *u- ■ f Dataxoherencyis:typically mam tamed m-a -multiprocess 

frequently needed by the CPU. In this manner, information — ^- t~zJ . - . - r-xz-o -i— j i tr^ 

\ c ^^ * J J • ^- c * sor system sucn:as-the-system~shown-m-FIGS.-l-and 2 by^ 

most frequently requested durmg execution of a computer /T": :, „ • - . . .lA 

^ I. .ji J J . *t. ^riTT r .1. first^issuing-iU-primary^memory- addrcss^rcquests-to- tne^ 

program may be rapidly provided to the CPU from the , . I- - r ^- 

onTi^* i_ I. 1 ■ 4- J * system'Diis'li7irrespective.of^whether:a" particular address' 

SRAM cache memory, thereby eliminatmg the need to ^ ~~r / ryn^^i - _ . ^ , 

,.1 . . , . request is to the executine CPU s:own"pTimary-meraory:(a 

access the slower primary and secondary memories. , ~:zr^r , 

Aiii. L r * *i. u • • rlocal„request)^or-to-another-CPU'S"pnmary memory-(a-^ 

Although fast, the SRAM cache memory is very expensive — * - .^^rm, • ^ - S - \ \ 

, . ft r . • 11 11^ ... \ ■30^remote request).-This-ensures-that jristructions are executed 

and IS therefore typically small to mmimize costs. . ^t- — . J. , , . • " j * *u 

^ ^ , . \ ... . by-thejvanous.CRUsin the order that they were issuedto ihe 

To further increase performance, high-end computer sys- system:bus.ll7the-^^d^f^;^£ich^siiSSbl^£i^ 

tems rnay employ inulUple central processing units (CPUs) i„stniction7Qd53f ^th7'comput«-J f^gram: ThusFfoi^ 

operating m parallel to allow for the simultaneous execution ins^IS^:responM{i^:ii:^r^'^che-SiSra:CPU-M^ 
0 mulUple mstructions of a computer program. FIG. 1 35 forwards-thc^Hin&yBQemotyjddress-re^iest to A 

illustrates a non-uniform memory architecture 1 having four ^us H.^ce-issuedrorthe s)^em-bus':ll,ahi a^^^^ 

CPU blocks lOA-lOD each connected to a system bus 11. ,;^st:is:avinible to_iU:CRU^ 20.^^6 C 

Referring also to FIG. 2, each CPU block 10 includes a CPU the_:addressxequest:fSne^:the-adar-^':request-b from? 

20, an external or L2 cache 24, and a prima^ inemory 25. u,e-system-bus-ll7 and thereafter searches its 'primary> 
The external cache 24 (E$) is typical y an SRAM device, « njemo"ryr25:fSHh7EspeCified datar^^^^^ 

?.'"'."!^nP,?'^«'^ T?""^ " " '^'l'""^ ' ^Z'^- monitor-theaddreK requesfissued-on-the-system-bus 11-to 

Each CPU 20 includes an external cache controller 21 to ge^^eU;know^snoi infotmation-for-The-requertiSg 

interface with its e«emal cache 24, and includes a pnmary eRU.- S5bo6-informatio^ " aainMiklciathe- co^sistencv> 

memory controller 22 lo mterface with its primary memory between the various CPU s 20-bY indicat ing whether d^ ' 
25. Each CPU 20 also mcludes a bus mterface unit 23 to 45 specifled-by-the address request -has been moditied while k t . 

interface with the system bus 11. Although not shown for storedJni^saiJ E ^other CPU m " CJI'^,MA^ 

simplicity in FIGS. 1 and 2, each CPU 20 includes an —5 — Z ^ . . nSt^J*^ ' 

. , , ■' , , , , . , . , . Routing an address request to pnmary memory 25- via the G." 

mtcrnal or LI cache, which m turn is typically divided into ,„u ~ . u L.- j < V 

.... L jj. t. ■iZ ■ \ .■ u sTOt6m;bus7ll:iD-response to-a cache miss advantageously! 

aninstruction cache and a data cache. The instruction cache ... j > - u • . 

nf.^ , . . J • . J J X maintams-proper data.coherency m-a-mutiprocessor-system; 
(1$) stot« frequently executed instriicUons. and the data 50 Howev^TrSitSian address reqiiiTfrei a CPU 's-cacheti 

cache (D$) stores frequently used data. /-inrr/ • * 

^^^^ \ , , the CPU s own primary memory 25 via the system bus 11 

FIG. 1 shows addiuonal devices connected to system bus ^^^^^^ ^^^^^ ^ bus 11. The multiple 

11. A secondary memory device 12 such as, for instance, a connections to the system bus 11 result in a relatively large 

hard-disk or tape drive, provides additional memory. A ^^^^^^ ^^^^ ^ ^ ^^^^^ 

monitor 13 provides users a graphical interface with systena 55 ^ause significant delays in arbitrating access to the system 

1. CPU blocks lOA-lOD are connected to a network 14 ^ arbitration delays undesirably increase the 

(e.g., a local area network, a wide area network a virtual j^j^j ^ 25. Since primary memory 

pnvaU network, or the Internet) vu^sternbus IV ^^^^^ ^^^^^^ increasing as quickly as are CPU 

punng.^xet:uuon:of_axompiiter-program,-the computer> processing speeds, it is becoming increasingly important to 

fx y^a- program-instruc^-thc,vaaous:CPUs:2CLoLsystem:l:t^ 60 reduce primary memory latencies in order to maximize CPU 

^ X) instnictions by_incrementing:programxounters-within-the- performance. Inaecdrit;^ld~be':mghly„^iSr5^^ 

I (Vmous-^Us^O^Cprp improvc:iprimary^m6morv latencies- in- a multiprocessor 

simpKcity)::inJfesponse:.therelo,.-each~CPU-20 feb^^^ computer-systemiwh'ilE^e^iiig:data:coherency. 
' instructioosridcntificd by -the compu tcr program; If. an> 

'insttuctiblOequestsrdata7:an address.re^ _ SUMMARY 

) cl^^^^°^^^'^^^ c6rfesponding.CPU:20 firet Amethod-is disclosed that reduces memorylatencies in a:? 

(^f\y searches its i nternal cach e fo r the da ta^if the-specified^ata multiprocessoTcomputer system oveTttielp^^^ 
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UprocessoTsystem includesTaTpluraUly^f-ce^ 

units ( CPU s)~coimected"4o^oni^ ano by a system bus? 

Each CPU igcludesXcacKa controlle^^ 

itsTcachc;:m_d:ajprimai7:mcmory contro 

with Jils :primary_memory. ln~accordance-with^ tfaeipresent s 

mvcntionrwhcn;thw:is:a^^^ 

controllcr 'routcs~ ^an~addrcss-rcquest:for::pnmarv^me^^ 

direcUjLtoItKe^nmM^y-^cmoT)^ 

reguestr_and:also:i&sues.the^^^a^ 

^tg^inaintainTdata^'cohercncy.t^Thc:^ -is^ i o 

queued-in- the _p rim ary .memory ^controUer,_and thereafter^ 
rctneves-specul_aU vc^da ta~frol^^^ 

addre^Jhe:CP.U^onitors:the system bus:for:reque^ 

thel^ecifieci^ primary- memory-address - If- ais^ 

transaction request^irig-the specified dat^^iS'lBe^re^ 15 

that was i^ued:on:the:system:bus.in:m 

miss,:the:speculative^quest:and:any:d 

is:sralidated:and:be(xmes:nonrSpeculative:rlf~on~ll^^ 

hand rtbc-subscquent-trans action^req uestin g-thc-spccifiedJ- 

data:islalwrite~transaction,:the.specxilatiYe3request:isxan^20 

celed:LThe:write3transacti6n^forcthcispcdfied"data~is-^ 

processeU-befofe^the-read-traHsacti 

in:ordcr;to:maintain"data^cbheTency."^ 

Thus, .in:contrast'toj)rior art:aTchitectu accesP 
to primaty^ein6r5Ccoml^^^^ 
esrabhshed:on:;thersyst'em-bus/-pf^ 
respons^:to:a:cach_e„inissr.route:aspecuh 
direcyy_ Jrgmr azCEU's-cache- controllerzto 
naemoryj:without:ajbitrating^c^ 
accessing primary^ mgaory^contc mpoianeouslyw^ 

buslaccess and tfaenr accessmgr primarvr memoryppresent- i 
ernhnf^irrinn^'^^y-r^^ cc primary-memory la^ 
in ti ■ fn i mprny^g r pU::performance 7 
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FIG. 1 is a block diagram of a conventional multiproces- 
sor computer system; 

FIG. 2 is a block diagram of a CPU block of the computer 
system of FIG. 1; 

FIG. 3 is a block diagram of a multiprocessor computer 
system incorporating CPU blocks in accordance with the 
present invention; 

FIG. 4 is a block diagram of a CPU block of the computer ^5 
system of FIG. 3 in accordance with one embodiment of the 
present invention; and 

FIG. 5 is a flow chart illustrating he operation of the 
system of FIG. 4 according to one embodiment of the 
present invention. 

Like reference numerals refer to corresponding parts 
throughout the drawing figures. 
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DETAILED DESCRIPTION 

FIG. 3 illustrates a multiprocessor computer system 100 
having four CPUs lOlA-lOlD constmctcd in accordance 
with the present invention. The CPUs lOlA-lOlD are 
connected to primary memories 102A-102D, respectively, 
via dedicated signal lines 103, and to the system bus 1 via 
signal lines 104. Primary memory 102 is preferably DRAM, 
although any well-known memory device may be used. 
System bus 11 provides connections to the secondary 
memory 12, the monitor 13, and the network 14 in a 
well-known manner. 

Referring also to FIG. 4, each CPU 101 includes an 
external cache (E$) controller 110, a memory interface unit 



55 



60 



120, and a bus interface unit 130. External cache controller 
110 includes an E$ address fetch circuit 111 to communicate 
with the external cache 24 via dedicated bus 112. Tlie E$ 
address fetch circuit 111 queries the external cache 110 for 
a specified address requested by an instruction executed by 
the CPU 101. The external cache 24 is preferably an SRAM 
cache device. Although shown as being external to CPU 
101, in some embodiments the external cache 24 may be 
included on the CPU 101. 

The memory interface unit 120 is shown to include a 
memory write queue (MWQ) 121, a memory read queue 
(MRQ) 122, a primary memory controller 123, and an 
output data buffer 124. The MWQ 121 and MRQ 122 are 
preferably Content addressable Memory-Random Access 
Memory (CAMRAM) devices. The MWQ 121 communi- 
cates with execution units within the CPU 101 via signal line 
125, and the MRQ 122 communicates with the CPU execu- 
tion units via signal line 126 (the CPU execution units are 
not shown for simplicity). In one embodiment, MWQ 121 
stores 4 write requests, and MRQ 122 stores 16 read 
requests. 

As mentioned above, multiprocessor computer systems 
simultaneously process multiple instructions of a computer 
program by delegating execution of the instructions amongst 
the various CPUs of the system, in 'accordance with the^ 
present:invention7 the 7multiproces«)r computer 
shown-in^GS^ 4 and 5 Tgduces 'primary memory latencies 
overprioTart multiprocessor arcKitectm 
lative-addresszrequestsrdirectly fromT'the'Texterrial'cacHe^ 
cpntroUer 110:to. the memory^interf ace" unit 120 
101rin"r6Sponsc-"to7ext^rnal'r^cachclniss^> cThc memory 
(^intefface~uniri20^aybegin-processing specula 
req uests before data coh erency in formation j oyailablezfrom 
the s ys tembu sjlT v 

In accordance with the present invention, the speculative 
address requests deceived directly from the external cache 
controller 110 are"^conciled with data coherency informa- 
tion provided by the system bus 11. Data coherency is 
maintained on the system bus U by issuing all- address 
requests from the^ external cache controller lie to the system 
bus 11. Thus, for instance, if a write transaction for a 
specified address^precedes a read transaction to the specified 
address in the execution order of the computer program, the 
write transaction is issued to the system bus 11 before the 
read transaction is^ issued to the system bus 11, irrespective 
of whether the two, transactions are executed by the same 
CPU 101 or by different CPUs 101. Thus, the CPUs 101 
monitor the system bus 11 for requests corresponding to 
speculative address^ requests being processed in respective 
memory interface units 120. If the CPU 101 receives from 
the system bus 11 a jwrite request to an address specified by 
a speculative requestj the speculative address request, and 
any data retrieved thereby, are canceled. If, on the other 
hand, the CPU 101] receives from the system bus 11 an 
address request that corresponds to the speculative request 
before receiving a write-back transaction for the specified 
address, the speculative request is validated and becomes a 
non-speculative address'^request. If speculative data has 
a lready been retrieved from primary memnrv 25^ the specu - 
l ative data is val id ated/and thereafter processed bv the CPU 
10t:in: a^cQpvenLionai manner. 

Thc-CPUs-lOl^also monitor the system^jSus^-fo Tsnoop^ /< P 0 ■^ji^^ 
jii fonTiatioKIpTwide d-hy f hc^other^GPus .-Jl: ine::snoj3p ^^Aytf^jt^ 
information in^icates^that^ecified:datciiv^ ^^xj^^^^^^ 
is"stale, all^xbfresponding^address:;^^ ^^1©0^ 
memo0,^nd:an5^;^ata2re ^ 
gPU-Umtrre^^ 
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updateH daia-tq-the^ppropri^^ 

llriConverselyHf-the-snoop-informationzindicatcszthat 
specified^dat^is cle^^^ 
^completed;: andrdat0^tnevedthereby:is:^ 
mri~^oaventi6nalImaDner.--When-tMs^condition-o^ 
pamary memorylatenc yisredu cfidioveripri^ 
ccssorxpmputer-architecturesZ^^ 

In sum, when there is an external cache miss, the address 
request is routed directly from the external cache controller 
110 to the primary memory interface unit 120 as a specu- 
lative address request, thereby avoiding any delays associ- 
ated with arbitrating access to the system bus 11. Thus, while 
the address request issued to the system bus 11 waits for 
access to the system bus 11, its corresponding speculative 
address request may be processed in the memory interface 
unit 120. In this manner, present embodiments may overlap 
primary memory latency with bus arbitration delays to 
improve CPU performance. Present embodiments achieve 
maximum performance gains when data corresponding to 
speculative address requests is retrieved from primary 
memory as or before bus ordering information and snoop 
information become available. 

The foregoing operations are more fully appreciated with 
reference to FIG. 5. The first processing step shown in FIG. 
5 is the generation of an external cache miss in response to 
an address request from the CPU (step 70). In some 
circumstances, the address request is accompanied by a 
write request, in which case the primary memory controller 
120 reads data from primary memory 25 and writes data to 
primary memory 25. In response to the external cache miss, 
the external cache address fetch circuit 111 routes the 
corresponding speculative address request to the memory 
interface unit 120 via line 114, and also routes the address 
request to the system bus 11 via line 115 (step 71). As 
mentioned above, the address request is routed to the system 
bus U to facilitate bus ordering and snooping functions, and 
the speculative address request is routed directly to and 
queued within the MRQ 122 to fetch speculative data. The 
speculative address request, as well as any data retrieved 
thereby from the primary memory 25, remains speculative 
until validated by data coherency information provided by 
the system bus 11. In some embodiments, if the MRQ 122 
is full when a speculative address request is received from 
the external cache controller 110, the speculative address 
request is dropped. 

Once issued on the system bus 11, address requests from 
the various CPUs lOlA-lOlD are available to all the CPUs 
lOlA-lOlD. Each CPU 101 receives a subsequent address 
request from the system bus 11 (step 72). If the request is not 
to the receiving CPU's own primary memory 25, but rather 
requests data from another CPU's primary memory, as 
determined by step 73, processing proceeds to step 82. On 
the other hand, if the request is to the receiving CPU^s own 
primary memory, i.e., the request is local, the CPU compares 
the address request with entries stored in its MRQ 122 to 
maintain data coherency (step 74). 

Each address request entry queued in the MRQ 122 has an 
associated valid bit indicative of whether the address request 
has been validated by ordering information from the system 
bus. Thus, speculative address requests received directly 
from the external cache controller 110 in response to an 
external cache miss initially have an un-asserted valid bit, 
while address requests that are consistent with and thus are 
validated by data coherency information on the system bus 
11 have an asserted valid bit. 

If there is not a match in the MRQ 122, as tested by step 
75, thereby indicating that there is not a corresponding 
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speculative address request, and the received request to local 
memory is a read transaction, as tested by step 76, the 
address request is queued in the MRQ 122 (step 77). Here, 
since the transaction to be queued in the MRQ 122 is 

5 received from the system bus 11, rather than from the 
external cache controller 110, its consistency with bus 
ordering is validated and, therefore, its valid bit is asserted 
when queued in the MRQ 122. 

If, on the other hand, there is a match in the MRQ 122, 

10 and the subsequent address request is a read transaction, as 
tested by step 78, the valid bit of the matching address 
request queued in the MRQ 122 is asserted (if not asserted 
already) to indicate that ordering on the system bus 11 has 
been verified (step 79). In this manner, speculative address 

15 requests received directly from the external cache controller 
110 are validated by the matching read transaction received 
from the system bus 11 to become non-speculative address 
requests. 

If the request to local memory received from the system 
bus 11 is a write transaction, as tested by step 78, and the 
valid bit of the matching address request queued in the MRQ 
122 is un-asserted to indicate that the queued request is 
speculative, as tested in step 80, the speculative address 
request is canceled, and the write transaction is issued to 
^ primary memory 25 (step 81). In this manner, the write 
transaction is issued to primary memory 25 prior to the 
matching read request to preserve data coherency. On the 
other hand, if the request queued in the MRQ 122 has an 
asserted valid bit, thereby indicating that the request has 
been validated by bus ordering, the queued address request 
remains queued in the MRQ 122, and processing proceeds 
to step 82. 

It should be noted that the local request is also provided 

25 to MWQ 121 via line 125 for comparison with write requests 
queued in the MWQ 121. If there is a match, the write 
request queued in the MWQ 121 is immediately dispatched 
to the primary memory controller 123 to write new data to 
the specified memory location. This preserves bus ordering. 

40 In a preferred embodiment, each address request issued to 
the system bus 11 has an associated transaction identification 
(ID), which in turn indicates which CPU issued the request. 
In such embodiments, each entry in the MRQ 122 includes 
a physical address CAM field and a transaction ID RAM 

45 field. Speculative address requests routed directly from the 
external cache controller 110 to the primary memory 25 do 
not include a transaction ID. When the CPU 101 receives an 
address request from the system bus 11, the request's 
transaction ID is used to either validate or to cancel corre- 

50 spending speculative address requests queued in the 122 in 
a manner similar to that described above. In some 
embodiments, each entry in the MRQ 122 includes a snoop- 
status RAM field for holding snoop information gathered 
from the system bus 11. 

55 In addition to selectively comparing requests to local 
memory with entries in its MRQ 122, each CPU 101 
compares address requests received from the system bus 11 
with entries in its external cache 24 to generate snoop 
information for the system 100 (step 82). For each such 

60 comparison, if there is not a match, as tested by step 83, the 
CPU 101 forwards to the system bus 11 snoop information 
indicating that data specified by the address request is clean 
(step 84). Conversely, if there is a match in its external cache 
24, the CPU 101 determines whether the specified data has 

65 been modified while stored in the external cache 24 (step 
85). If the specified data has not been modified, the CPU 101 
forwards clean snoop information to the system bus 11 (step 
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84). If, on the other hand, the specified data has been 
modified while in the external cache 24, the CPU 101 
forwards to system bus 11 snoop information indicating that 
the specified data is dirty, and dispatches the updated speci- 
fied data to the system bus 11 (step 86). 5 

The CPUs lOlA-lOlD continually monitor the system 
bus 11 for snoop information (step 87). If the snoop infor- 
mation indicates that specified data is dirty, as tested by step 
88, the coaesponding address request queued in the MRQ 
122 is canceled, and updated data corresponding to the 10 
canceled address request is received from the system bus 11 
(step 89). If, on the other hand, the snoop information 
indicates that the specified data is clean, the CPU completes 
the address request and associated read operation (step 90). 

Although described above and depicted in the flow chart 
of FIG, 5 as comparing address requests received from the 
system bus 11 with entries in the external cache 24 before 
comparing with entries in the MRQ 122, in some embodi- 
ments the comparisons are performed simultaneously. In 
other embodiments, address requests received from the 
system bus 11 are compared with entries in he MRQ 122 
before being compared with entries in the external cache 24. 

The data buffer 124 within the memory interface unit 120 
stores data retrieved from the primary memory 25 for which 
(1) its corresponding address request remains speculative, 
i.e., not yet validated with respect to bus ordering informa- 
tion and (2) snoop information is not yet available. Thus, for 
instance, if a speculative address request queued in the MRQ 
122 is issued to the primary memory 25 before its valid bit 
is asserted by a corresponding read transaction on the system 
bus 11, the speculative data retrieved from the primary 
memory 25 remains in the data buffer 124 until either (1) a 
matching read transaction validates the data or (2) a match- 
ing write transaction cancels the data. If snoop results are not 
yet available for data retrieved from the primary memory 25, 
the data is stored in the data buffer 124 until either (1) clean 
snoop information is returned, in which case the data is 
forwarded to the CPU 101 via line 129 for processing or (2) 
dirty snoop information is returned, in which case the data 
is canceled. In one embodiment, the data buffer 124 is a 
RAM register file having 4 cache fines of data. 

While particular embodiments of the present invention 
have been shown and described, it will be obvious to those 
skilled in the art that changes and modifications may be 
made without departing firom this invention in its broader 
aspects and, therefore, the appended claims are to encom- 
pass within their scope all such changes and modifications as 
fall within the true spirit and scope of this invention. 
4SWeielaLims^ 

1. A method of processing data in a multiprocessor 
system, the method comprising: 

generating an address request for a primary memory in 
response to a cache miss; 

routing the address request directly from the cache to the 
primary memory as a speculative request; 

issuing the address request to a system bus to establish 
data coherency information between the various pro- 
cessors; 

receiving the data coherency information from the system 60 
bus; and 

selectively validating the speculative request in response 
to the data coherency information established on the 
system bus. 

2. The method of claim 1, wherein the speculative request 65 
is validated by asserting one or more valid bits associated 
with the speculative request. 
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3. The method of claim 1, further comprising: 
retrieving from the system bus snoop information pro- 
vided by one or more of the processors in the system; 
and 

selectively canceling the speculative request in response 
to the snoop information. 

4. A method of processing data in a multiprocessor 
system, the method comprising: 

generating, within a selected one of a plurality of central 
processing units (CPUs) of the system, an address 
request for a primary memory in response to a cache 
miss; 

routing the address request directly from the cache to the 
primary memory via the selected CPU as a speculative 
request; 

issuing the address request to the plurality of CPUs via a 

system bus connected between the plurality of CPUs; 
receiving a subsequent request from the system bus into 

the selected CPU; 
if the subsequent request is the address request issued in 

response to the cache miss, validating the speculative 

request; and 

if the subsequent request is a write request to an address 
specified by the speculative request, canceling the 
speculative request. 

5. The method of claim 4, further comprising: 
queuing the speculative request in a memory read queue 

of the primary memory. 

6. The method of claim 4, further comprising: 
forwarding the speculative request to the primary memory 

to retrieve corresponding data. 

7. The method of claim 6, further comprising: 

storing the corresponding data in a data buffer until 
validated by snoop information. 

8. A method of processing data in a multiprocessor 
system, the method comprising: 

generating an address request for a primary memory in 

response to a cache miss; 
issuing the address request from the cache to a system bus 

as a transaction to facilitate data coherency on the 

system bus; 

routing the address request directly to the primary 

memory as a speculative request; 
queuing the speculative request in the primary memory 

for retrieval of specified data from the primary 

memory; 

monitoring the system bus for the transaction correspond- 
ing to the speculative request; and 

canceling the speculative request if a write request cor- 
responding to the specified data is detected on the 
system bus before the transaction corresponding to the 
speculative request is detected on the system bus. 

9. The method of claim 8, further comprising: 
validating the speculative request if the transaction cor- 
responding to the speculative request is detected on the 
system bus before a write request corresponding to the 
specified data is detected on the system bus. 

10. The method of claim 8, further comprising: 
retrieving from the system bus snoop information pro- 
vided by one or more central processing units within 
the system bus; and 

selectively canceling the speculative request in response 
to the snoop information. 
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11. A computer system; having a plurality of central 
processing units (CPUs) connected to a system bus, each 
CPU connected to an associated primary memory and com- 
prising: 

a cache; 5 
a cache controller connected to the cache and to the 
system bus; 

a primary memory controller connected to the primary 
memory and to the cache controller, 

wherein the cache controller, in response to a cache miss, 
routes an address request directly from the cache to the 
primary memory controller as a speculative read 
request without accessing the system bus, the primary 
memory controller thereafter validating the speculative 15 
read request using data coherency information 
retrieved from the system bus. 

12. The system of claim 11, wherein the address request 
comprises a transaction ID indicating which of the plurality 

of CPUs issued the address request. 20 

13. The system of claim 11, wherein the speculative read 
request comprises one or more valid bits indicating whether 
the speculative read request is validated by the data coher- 
ency information. 

14. The system of claim 11, wherein the primary memory 25 
controller further comprises a memory read queue to queue 
the speculative read request. 
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15. The system of claim 14, wherein the memory read 
queue comprises: 

a CAM portion to store physical addresses for corre- 
sponding address requests to the primary memory; and 

a RAM portion to store one or more valid bits indicative 
of whether corresponding address requests are specu- 
lative requests. 

16. The system of claim 11, wherein the cache comprises 
an external cache. 

17. The system of claim 11, wherein the primary memory 
controller further comprises a write address queue to store 
write requests. 

18. The method of claim 1, wherein the selectively 
validating comprises: 

validating the speculative read request if the address 
request is received from the system bus before a write 
request for the same data is detected on the system bus. 

19. The method of claim 1, wherein the selectively 
validating further comprises: 

canceling the speculative read request if a write request 
for the same data is detected on the system bus before 
the address request is received from the system bus. 

***** 
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